Sequencing and Raw Sequence Data Quality Control ◾ 17
The “awk” command extracts the first column and fourth column from “SRR030834_tab.
txt” and prints the two columns separated by a tab. The output is directed to a new text file
“SRR030834_seq.txt” (Figure 1.8).
Linux commands allow us to do multi-step operations. Assume that we want to create a
FASTA file from the FASTQ file; we can do that in multiple steps. First, we need to extract
both IDs and sequences in a file as we did above, then we can remove “@” symbol leaving
only the IDs, then we need to add “>” in the beginning of each line with no space between
the “>” and the IDs, and finally, we separate the two columns, forming the definition line
(defline) of FASTA and the sequence, store them in a file, and delete the temporary files.
cat SRR030834.fastq | paste - - - - \
> SRR030834_tab.tmp
awk ‘{print $1 “\t” $4}’ SRR030834_tab.tmp \
| sed ‘s/@//g’ > SRR030834_seq.tmp
sed -i ‘s/^/>/’ SRR030834_seq.tmp
awk ‘{print $1, “\n” $2}’ SRR030834_seq.tmp \
> SRR030834.fasta
rm *.tmp
In the FASTA format, as shown in Figure 1.9, each entry contains a definition line and a
sequence. The defline begins with “>” and can contain an identifier immediately after “>”
(no whitespace in between).
FIGURE 1.8 Extracting IDs and sequence of a FASTQ file.
FIGURE 1.9 Extracting FASTA sequence from the FASTQ file.